Finding duplicates in a data stream

نویسندگان

  • Parikshit Gopalan
  • Jaikumar Radhakrishnan
چکیده

Given a data stream of length n over an alphabet [m] where n > m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((logm)) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this problem could be solved using sub-linear space and one pass over the input. Our algorithm solves the more general problem of finding a positive frequency element in a stream given by frequency updates where the sum of all frequencies is positive. Our main tool is an Isolation Lemma that reduces this problem to the task of detecting and identifying a Dictatorial variable in a Boolean halfspace. We present various relaxations of the condition n > m, under which one can find duplicates efficiently.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dynamically Adaptive Count Bloom Filter for Handling Duplicates in Data Stream

Identifying and removing duplicates in Data Stream applications is one of the primary challenges in traditional duplicate elimination techniques. It is not feasible in many streaming scenarios to eliminate precisely the occurrence of duplicates in an unbounded data stream. However, existing variants of the Bloom filter cannot support dynamic in both filter and counter together. In this paper we...

متن کامل

STREAM CORRIDORS AS INVALUABLE URBAN ELEMENTS: SUGGESTIONS FOR IMPROVEMENT OF PAVEH STREAM

  The study seeks to address the importance of urban stream ecosystems from the perspective of urban ecology, human health and social well-being in the context of urban planning. The case study area is Paveh stream in the City of Paveh. The data from the case study area were gathered from questionnaire, existing scientific and library studies and by conducting interviews with residents and auth...

متن کامل

STREAM CORRIDORS AS INVALUABLE URBAN ELEMENTS: SUGGESTIONS FOR IMPROVEMENT OF PAVEH STREAM

  The study seeks to address the importance of urban stream ecosystems from the perspective of urban ecology, human health and social well-being in the context of urban planning. The case study area is Paveh stream in the City of Paveh. The data from the case study area were gathered from questionnaire, existing scientific and library studies and by conducting interviews with residents and auth...

متن کامل

Discriminative Identification of Duplicates

The problem of finding duplicates in data is ubiquitous in data mining. We cast the problem of finding duplicates in sequential data into a poly-cut problem on a fully connected graph. The edge weights can be identified with parameterized pairwise similarities between objects that are optimized by structural support vector machines on labeled training sets. Our approach adapts the similarity me...

متن کامل

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009